26  Predict air pollution levels using linear regression model

Learning Objectives

After completing this tutorial you should be able to

Download the directory for this project here, make sure the directory is unzipped and move it to your bi328 directory. You can open the Rproj for this module either by double clicking on it which will launch Rstudio or by opening Rstudio and then using File > Open Project or by clicking on the Rproject icon in the top right of your program window and selecting Open Project.

There should be a file named 26_linear-regression-models.qmd in that project directory. Use that file to work through this tutorial - you will hand in your rendered (“knitted”) quarto file as your homework assignment. So, first thing in the YAML header, change the author to your name. You will use this quarto document to record your answers. Remember to use comments to annotate your code; at minimum you should have one comment per code set1 you may of course add as many comments as you need to be able to recall what you did. Similarly, take notes in the document as we discuss discussion/reflection questions but make sure that you go back and clean them up for “public consumption”.

  • 1 You should do this whether you are adding code yourself or using code from our manual, even if it isn’t commented in the manual… especially when the code is already included for you, add comments to describe how the function works/what it does as we introduce it during the participatory coding session so you can refer back to it.

  • Before you get started, let’s make sure to read in all the packages we will need2.

  • 2 Check that all of these packages are installed before you load them!

  • # load libraries ----
    
    # reporting
    library(knitr)
    
    # visualization
    library(ggplot2)
    library(ggthemes)
    library(patchwork)
    
    # data wrangling
    library(dplyr)
    library(tidyr)
    library(readr)
    library(skimr)
    library(janitor)
    library(magrittr)
    library(lubridate)
    
    # modeling
    library(tidymodels)
    library(vip)
    
    # set other options ----
    
    # scientific notation
    options(scipen=999)
    
    # turn off messages and warnings
    knitr::opts_chunk$set(
      tidy = FALSE, 
      message = FALSE,
        warning = FALSE)
    
    # set the ggplot theme for the document
    theme_set(theme_bw())
    
    # set random seed
    set.seed(42)

    26.1 Building a machine learning model

    We already have a data set with features and outcome variables that we’ve thoroughly explored. Let’s go ahead and read that data set in and make the transformations we concluded are needed so we are ready to starting training a model.

    # read & wrangle data set
    pm <- read_delim("data/air_pollution.csv", delim = ",") %>%
      clean_names() %>%
      mutate(across(c(id, fips, zcta), as.factor)) %>%
      mutate(city = case_when(city == "Not in a city" ~ "Not in a city",
                              city != "Not in a city" ~ "In a city"))

    We are going to use the tidymodels ecosystem which consists of a series of packages to assist in various steps of the model building process. These packages were designed using a standard syntax and the goal of having a standardize workflow and syntax across different types of machine learning algorithms. This also means that it is straightforward to modify pre-processing, algorithm choice, and hyper-parameter tuning for optimization.

    These are the individual steps that we will go through to train our model3.

  • 3 You are already familiar with some of these steps for some simple linear regressions that we have run (Steps 3, 5, 6, 7). Use this list to keep track of where we are in the process.

    1. Split data into testing and training sets.
    2. Create recipe + assign variable roles
    3. Specify model, engine, and mode
    4. Create workflow, add recipe & model
    5. Fit workflow
    6. Get predictions
    7. Use predictions to get performance metrics.

    26.2 Split the data

    Consider this

    We have previously discussed the problem with underfitting and overfitting. Briefly describe these two terms and explain how they can become a problem when building models.

    Comparison of underfit (left), optimal (middle), or overfit (right) model.
    Did it!

    [Your answer here]

    A good solution to this issue is to split our data set into a training and testing data set. The training data set will be used to build and tune our model. Then we can determine how well our model describe the relationship between outcome and predictor values.

    Once we have created our testing data set we will set that aside until we have completed optimizing our model with the training set to minimize the bias in evaluating the performance of our model.

    We will use rsample::initial_split() to randomly subset our data into a training (2/3 of observations) and test (1/3 of observations), data set.

    # split sample
    pm_split <- rsample::initial_split(data = pm, prop = 2/3)
    
    # check proportions of split
    pm_split
    <Training/Testing/Total>
    <584/292/876>

    Next, we will extract the testing and training data to create to separate data.frames()

    # training data set
    train_pm <- training(pm_split)
    
    # test data set
    test_pm <- testing(pm_split)

    26.3 Prepare for pre-proceesing

    Now that we have split the data we need to process the training and testing data, we need to make sure they are compatible and optimized for building the model. This process is called feature engineering and involves assigning variables to specific roles, removing redundant variables if they are present, and scaling data as needed.

    26.3.1 Step 1: Specify variable roles with recipe()

    In the tidymodels framework we can do this by first creating a “recipe” describing all the individual processing steps we want to take - this is especially helpful if we have multiple data sets we are going to work with and/or if we are going to be re-running the processing multiple time.

    First, we will create a recipe that specifies the roles of individual variables, i.e. which are the outcome and which the predictor variables4.

  • 4 Keep in mind that the recipe just describes the steps to take, it does not actually execute them

  • For our data set we want to specify that value (the PM2.5 concentration measure by air pollution monitors) is our outcome variable, and that our features (predictor variables) are all the other variables with the exception of the monitor ID (id). We don’t want to include this as a predictor variable as it is unique to each monitor so it will only add noise - however, down the line we will still want to have this information so we want to keep it in the data set.

    # build recipe
    simple_rec <-recipe(train_pm) %>%
        update_role(everything(), new_role = "predictor")%>%  # specify predictor variables
        update_role(value, new_role = "outcome")%>%           # specify outcome variable
        update_role(id, new_role = "id variable")             # specify id as id variable
    
    simple_rec

    Let’s take closer look at our recipe

    summary(simple_rec) %>%
      kable()
    variable type role source
    id factor , unordered, nominal id variable original
    value double , numeric outcome original
    fips factor , unordered, nominal predictor original
    lat double , numeric predictor original
    lon double , numeric predictor original
    state string , unordered, nominal predictor original
    county string , unordered, nominal predictor original
    city string , unordered, nominal predictor original
    cmaq double , numeric predictor original
    zcta factor , unordered, nominal predictor original
    zcta_area double , numeric predictor original
    zcta_pop double , numeric predictor original
    imp_a500 double , numeric predictor original
    imp_a1000 double , numeric predictor original
    imp_a5000 double , numeric predictor original
    imp_a10000 double , numeric predictor original
    imp_a15000 double , numeric predictor original
    county_area double , numeric predictor original
    county_pop double , numeric predictor original
    log_dist_to_prisec double , numeric predictor original
    log_pri_length_5000 double , numeric predictor original
    log_pri_length_10000 double , numeric predictor original
    log_pri_length_15000 double , numeric predictor original
    log_pri_length_25000 double , numeric predictor original
    log_prisec_length_500 double , numeric predictor original
    log_prisec_length_1000 double , numeric predictor original
    log_prisec_length_5000 double , numeric predictor original
    log_prisec_length_10000 double , numeric predictor original
    log_prisec_length_15000 double , numeric predictor original
    log_prisec_length_25000 double , numeric predictor original
    log_nei_2008_pm25_sum_10000 double , numeric predictor original
    log_nei_2008_pm25_sum_15000 double , numeric predictor original
    log_nei_2008_pm25_sum_25000 double , numeric predictor original
    log_nei_2008_pm10_sum_10000 double , numeric predictor original
    log_nei_2008_pm10_sum_15000 double , numeric predictor original
    log_nei_2008_pm10_sum_25000 double , numeric predictor original
    popdens_county double , numeric predictor original
    popdens_zcta double , numeric predictor original
    nohs double , numeric predictor original
    somehs double , numeric predictor original
    hs double , numeric predictor original
    somecollege double , numeric predictor original
    associate double , numeric predictor original
    bachelor double , numeric predictor original
    grad double , numeric predictor original
    pov double , numeric predictor original
    hs_orless double , numeric predictor original
    urc2013 double , numeric predictor original
    urc2006 double , numeric predictor original
    aod double , numeric predictor original
    Table 26.1: Roles assigned to each feature in the data set.

    Success! All of our variables now have specific roles as either the outcome variable, features, and the id column.

    26.3.2 Step 2: Specify pre-processing steps

    Our next step is to use the recipe::step*() functions to specify any necessary pre-processing steps, this could include a variety of steps needed to transform our data, for example filling in missing values (imputation), converting continuous variables in to discrete variables (binning them), encoding and creating dummy variables, data type conversions, or normalization.

    Because we are in the extended tidyverse we can use various functions to help select of variables to apply steps to:

    1. Use tidyselect methods such as contains(), matches(), starts_with(), ends_with(), everything(), num_range().
    2. Use the data type of a column, e.g. all_nominal(), all_numeric(), has_type()
    3. Use the role assigned to variable (see above) all_predictors(), all_outcomes(), has_role()
    4. We can just use the name of the variable.

    Let’s look at a couple of specific examples for what we need to pay attention to during preprocessing.

    A typical pre-processing step is what is called one-hot encoding which describes a way that categorical variables are converted to dummy variables (numbers) so that they can be used with certain algorithms that only take certain data types as input. Because we don’t want the algorithm to interpret this variables as continuous numerical variables, we make it explicit that they are binary (0s and 1s, no order).

    simple_rec %>%
      step_dummy(state, country, city, zcta, one_hot = TRUE)

    The fips column contains a numeric code for state and county so it is redundant - we should also assign it as an id

    simple_rec %>%
      update_role("fips", new_role = "county id")

    We know from our exploratory analysis that we have a series of variables that are redundant and/or highly correlated with each other; we will want to remove these, this can be done using step_corr(). However we want to keep some variables (CMAQ, aod) so we will explicitly exclude them from being removed.

    simple_rec %>%
      step_corr(all_predictors(), - CMAQ, - aod)

    Variables with near zero variance will not be informative and will likely only include additional noise, so we would also want to remove those.

    simple_rec %>%
      step_nzv(all_predictors(), - CMAQ, - aod)

    The benefit of the recipes package is that we can create one single recipe that summarizes all the steps that we want to take before starting to build are model.

    # create final recipe
    simple_rec <- recipe(train_pm) %>%
        update_role(everything(), new_role = "predictor") %>%
        update_role(value, new_role = "outcome") %>%
        update_role(id, new_role = "id variable") %>%
        update_role("fips", new_role = "county id") %>%
        step_dummy(state, county, city, zcta, one_hot = TRUE) %>%
        step_corr(all_numeric()) %>%
        step_nzv(all_numeric())
      
    simple_rec

    26.4 Run pre-processing

    So far we only have a recipe5. Our next step will be to complete the pre-processing and see how it affects our data set.

  • 5 Think of this as a plan on how we want to pre-process our data

  • 26.4.1 Step 1: Update the recipe with training data using prep()

    The function prep() will update the recipe object based on the training data by estimating parameters for pre-processing and updating the variable roles.

    # update recipe with training data, retain training data set
    prepped_rec <- prep(simple_rec, verbose = TRUE, 
                        retain = TRUE)
    oper 1 step dummy [training] 
    oper 2 step corr [training] 
    oper 3 step nzv [training] 
    The retained training set is ~ 0.27 Mb  in memory.
    names(prepped_rec)
     [1] "var_info"       "term_info"      "steps"          "template"      
     [5] "levels"         "retained"       "requirements"   "tr_info"       
     [9] "orig_lvls"      "last_term_info"

    prepped_rec is a list; the various elements contain a lot of useful information about our training set.

    • steps: contains the pre-processing steps that were run
    • var_info contains the original variable information
    • term_info is the updated variable after pre-processing
    • levels are the new levels, the original levels are in orig_lvls
    • tr_info contains info about the training data set size and completeness

    26.4.2 Step 2: Extract the pre-processing training data set using bake()

    We retained our pre-processed training data set using retain = TRUE so we can take a look at our training data using recipes::bake().

    # extract training data set
    baked_train <- bake(prepped_rec, new_data = NULL)
    
    # overview
    glimpse(baked_train)
    Rows: 584
    Columns: 39
    $ id                          <fct> 36001.0012, 19045.0019, 8069.0009, 6029.00…
    $ value                       <dbl> 8.225743, 12.156484, 6.679167, 21.201299, …
    $ fips                        <fct> 36001, 19045, 8069, 6029, 13153, 8013, 390…
    $ lat                         <dbl> 42.68075, 41.82328, 40.57129, 35.38557, 32…
    $ lon                         <dbl> -73.75733, -90.21198, -105.07969, -119.015…
    $ cmaq                        <dbl> 6.934664, 8.476871, 3.848796, 5.747609, 11…
    $ zcta_area                   <dbl> 20537352, 287014236, 36755388, 10828653, 4…
    $ zcta_pop                    <dbl> 11303, 28293, 33409, 12248, 2888, 6578, 19…
    $ imp_a500                    <dbl> 25.23788927, 0.00000000, 28.98961938, 54.6…
    $ imp_a15000                  <dbl> 13.2312860, 2.8746265, 5.9575519, 19.12959…
    $ county_area                 <dbl> 1354056384, 1799821282, 6723613486, 210615…
    $ county_pop                  <dbl> 304204, 49116, 299630, 839631, 139900, 294…
    $ log_dist_to_prisec          <dbl> 4.645390, 7.393982, 6.318517, 5.811309, 6.…
    $ log_pri_length_5000         <dbl> 10.720038, 8.517193, 8.517193, 8.518782, 8…
    $ log_pri_length_25000        <dbl> 13.10303, 12.14475, 12.27618, 11.00095, 13…
    $ log_prisec_length_500       <dbl> 8.320712, 6.214608, 6.214608, 7.931814, 6.…
    $ log_prisec_length_1000      <dbl> 9.652363, 7.600902, 8.783157, 9.415249, 8.…
    $ log_prisec_length_5000      <dbl> 12.022923, 11.109942, 10.758653, 11.446320…
    $ log_prisec_length_10000     <dbl> 13.36628, 12.00213, 12.22083, 12.08351, 11…
    $ log_nei_2008_pm10_sum_10000 <dbl> 5.027644654, 7.022428553, 5.388681313, 5.8…
    $ log_nei_2008_pm10_sum_15000 <dbl> 5.83021624, 7.02829583, 5.63414841, 6.0111…
    $ log_nei_2008_pm10_sum_25000 <dbl> 6.7434101, 7.0489046, 6.6266994, 6.1962012…
    $ popdens_county              <dbl> 224.661250, 27.289376, 44.563835, 39.86555…
    $ popdens_zcta                <dbl> 550.363065, 98.576992, 908.955171, 1131.07…
    $ nohs                        <dbl> 4.3, 4.3, 1.7, 12.5, 0.8, 0.0, 3.9, 4.3, 1…
    $ somehs                      <dbl> 9.1, 8.1, 5.0, 17.0, 7.4, 0.0, 10.2, 12.8,…
    $ hs                          <dbl> 23.2, 38.2, 14.9, 28.6, 14.4, 0.0, 46.6, 3…
    $ somecollege                 <dbl> 14.6, 22.7, 21.7, 20.4, 40.6, 0.0, 21.0, 2…
    $ associate                   <dbl> 4.6, 11.4, 7.7, 8.5, 12.7, 2.1, 8.0, 7.5, …
    $ bachelor                    <dbl> 21.3, 11.2, 29.7, 9.1, 16.5, 85.4, 7.0, 9.…
    $ grad                        <dbl> 22.9, 4.1, 19.3, 3.9, 7.6, 12.5, 3.4, 3.7,…
    $ pov                         <dbl> 27.000000, 10.100000, 9.500000, 31.400000,…
    $ hs_orless                   <dbl> 36.6, 50.6, 21.6, 58.1, 22.6, 0.0, 60.7, 5…
    $ urc2013                     <dbl> 3, 5, 3, 3, 4, 3, 1, 6, 1, 4, 2, 4, 4, 5, …
    $ aod                         <dbl> 42.62500, 48.55556, 29.33333, 62.50000, 35…
    $ state_California            <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
    $ state_Illinois              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
    $ state_Ohio                  <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
    $ city_Not.in.a.city          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …

    Compare this to our original data set

    glimpse(pm)
    Rows: 876
    Columns: 50
    $ id                          <fct> 1003.001, 1027.0001, 1033.1002, 1049.1003,…
    $ value                       <dbl> 9.597647, 10.800000, 11.212174, 11.659091,…
    $ fips                        <fct> 1003, 1027, 1033, 1049, 1055, 1069, 1073, …
    $ lat                         <dbl> 30.49800, 33.28126, 34.75878, 34.28763, 33…
    $ lon                         <dbl> -87.88141, -85.80218, -87.65056, -85.96830…
    $ state                       <chr> "Alabama", "Alabama", "Alabama", "Alabama"…
    $ county                      <chr> "Baldwin", "Clay", "Colbert", "DeKalb", "E…
    $ city                        <chr> "In a city", "In a city", "In a city", "In…
    $ cmaq                        <dbl> 8.098836, 9.766208, 9.402679, 8.534772, 9.…
    $ zcta                        <fct> 36532, 36251, 35660, 35962, 35901, 36303, …
    $ zcta_area                   <dbl> 190980522, 374132430, 16716984, 203836235,…
    $ zcta_pop                    <dbl> 27829, 5103, 9042, 8300, 20045, 30217, 901…
    $ imp_a500                    <dbl> 0.01730104, 1.96972318, 19.17301038, 5.782…
    $ imp_a1000                   <dbl> 1.4096021, 0.8531574, 11.1448962, 3.867647…
    $ imp_a5000                   <dbl> 3.3360118, 0.9851479, 15.1786154, 1.231141…
    $ imp_a10000                  <dbl> 1.9879187, 0.5208189, 9.7253870, 1.0316469…
    $ imp_a15000                  <dbl> 1.4386207, 0.3359198, 5.2472094, 0.9730444…
    $ county_area                 <dbl> 4117521611, 1564252280, 1534877333, 201266…
    $ county_pop                  <dbl> 182265, 13932, 54428, 71109, 104430, 10154…
    $ log_dist_to_prisec          <dbl> 4.648181, 7.219907, 5.760131, 3.721489, 5.…
    $ log_pri_length_5000         <dbl> 8.517193, 8.517193, 8.517193, 8.517193, 9.…
    $ log_pri_length_10000        <dbl> 9.210340, 9.210340, 9.274303, 10.409411, 1…
    $ log_pri_length_15000        <dbl> 9.630228, 9.615805, 9.658899, 11.173626, 1…
    $ log_pri_length_25000        <dbl> 11.32735, 10.12663, 10.15769, 11.90959, 12…
    $ log_prisec_length_500       <dbl> 7.295356, 6.214608, 8.611945, 7.310155, 8.…
    $ log_prisec_length_1000      <dbl> 8.195119, 7.600902, 9.735569, 8.585843, 9.…
    $ log_prisec_length_5000      <dbl> 10.815042, 10.170878, 11.770407, 10.214200…
    $ log_prisec_length_10000     <dbl> 11.88680, 11.40554, 12.84066, 11.50894, 12…
    $ log_prisec_length_15000     <dbl> 12.205723, 12.042963, 13.282656, 12.353663…
    $ log_prisec_length_25000     <dbl> 13.41395, 12.79980, 13.79973, 13.55979, 13…
    $ log_nei_2008_pm25_sum_10000 <dbl> 0.318035438, 3.218632928, 6.573127301, 0.0…
    $ log_nei_2008_pm25_sum_15000 <dbl> 1.967358961, 3.218632928, 6.581917457, 3.2…
    $ log_nei_2008_pm25_sum_25000 <dbl> 5.067308, 3.218633, 6.875900, 4.887665, 4.…
    $ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 3.31111648, 6.69187313, 0.0000…
    $ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 3.3500…
    $ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 5.171920, 4.…
    $ popdens_county              <dbl> 44.265706, 8.906492, 35.460814, 35.330814,…
    $ popdens_zcta                <dbl> 145.716431, 13.639555, 540.887040, 40.7189…
    $ nohs                        <dbl> 3.3, 11.6, 7.3, 14.3, 4.3, 5.8, 7.1, 2.7, …
    $ somehs                      <dbl> 4.9, 19.1, 15.8, 16.7, 13.3, 11.6, 17.1, 6…
    $ hs                          <dbl> 25.1, 33.9, 30.6, 35.0, 27.8, 29.8, 37.2, …
    $ somecollege                 <dbl> 19.7, 18.8, 20.9, 14.9, 29.2, 21.4, 23.5, …
    $ associate                   <dbl> 8.2, 8.0, 7.6, 5.5, 10.1, 7.9, 7.3, 8.0, 4…
    $ bachelor                    <dbl> 25.3, 5.5, 12.7, 7.9, 10.0, 13.7, 5.9, 17.…
    $ grad                        <dbl> 13.5, 3.1, 5.1, 5.8, 5.4, 9.8, 2.0, 8.7, 2…
    $ pov                         <dbl> 6.1, 19.5, 19.0, 13.8, 8.8, 15.6, 25.5, 7.…
    $ hs_orless                   <dbl> 33.3, 64.6, 53.7, 66.0, 45.4, 47.2, 61.4, …
    $ urc2013                     <dbl> 4, 6, 4, 6, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
    $ urc2006                     <dbl> 5, 6, 4, 5, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
    $ aod                         <dbl> 37.36364, 34.81818, 36.00000, 33.08333, 43…
    Consider this

    Compare the two outputs and see what has changes. Pay specific attention to the number of variables, data types and how many predictor variables do we have?

    Did it!

    [Your answer here]

    We have 36 variables instead of 50 - two are now ID variables, one is the outcome variable leaving us with 33 predictor variables. We no longer have categorical variables (e.g. state), and state_California is the only state identifier that does not have nonzero variance

    26.4.3 Step 3: Extract the pre-processed testing data using bake().

    We’ve extracted the training data set but we should also apply the recipe we designed to our test data set to check for any issues (e.g. introduction of NA values).

    # extract test data set
    baked_test_pm <- bake(prepped_rec, new_data = test_pm)
    
    # compare effect
    glimpse(baked_test_pm)
    Rows: 292
    Columns: 39
    $ id                          <fct> 1003.001, 1049.1003, 1055.001, 1069.0003, …
    $ value                       <dbl> 9.597647, 11.659091, 12.375394, 10.508850,…
    $ fips                        <fct> 1003, 1049, 1055, 1069, 1073, 1089, 1097, …
    $ lat                         <dbl> 30.49800, 34.28763, 33.99375, 31.22636, 33…
    $ lon                         <dbl> -87.88141, -85.96830, -85.99107, -85.39077…
    $ cmaq                        <dbl> 8.098836, 8.534772, 9.241744, 9.121892, 9.…
    $ zcta_area                   <dbl> 190980522, 203836235, 154069359, 162685124…
    $ zcta_pop                    <dbl> 27829, 8300, 20045, 30217, 21725, 21297, 2…
    $ imp_a500                    <dbl> 0.01730104, 5.78200692, 16.49307958, 19.13…
    $ imp_a15000                  <dbl> 1.4386207, 0.9730444, 5.1612102, 4.7401296…
    $ county_area                 <dbl> 4117521611, 2012662359, 1385618994, 150173…
    $ county_pop                  <dbl> 182265, 71109, 104430, 101547, 658466, 334…
    $ log_dist_to_prisec          <dbl> 4.648181, 3.721489, 5.261457, 7.112373, 4.…
    $ log_pri_length_5000         <dbl> 8.517193, 8.517193, 9.066563, 8.517193, 8.…
    $ log_pri_length_25000        <dbl> 11.32735, 11.90959, 12.01356, 10.12663, 12…
    $ log_prisec_length_500       <dbl> 7.295356, 7.310155, 8.740680, 6.214608, 8.…
    $ log_prisec_length_1000      <dbl> 8.195119, 8.585843, 9.627898, 7.600902, 9.…
    $ log_prisec_length_5000      <dbl> 10.815042, 10.214200, 11.728889, 12.298627…
    $ log_prisec_length_10000     <dbl> 11.88680, 11.50894, 12.76828, 12.99414, 11…
    $ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 0.00000000, 4.43719884, 0.9288…
    $ log_nei_2008_pm10_sum_15000 <dbl> 2.267834, 3.350044, 4.462679, 3.674739, 5.…
    $ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 5.171920, 4.678311, 3.744629, 8.…
    $ popdens_county              <dbl> 44.265706, 35.330814, 75.367038, 67.619664…
    $ popdens_zcta                <dbl> 145.71643, 40.71896, 130.10374, 185.73917,…
    $ nohs                        <dbl> 3.3, 14.3, 4.3, 5.8, 2.8, 1.2, 5.3, 23.9, …
    $ somehs                      <dbl> 4.9, 16.7, 13.3, 11.6, 8.2, 3.1, 14.2, 17.…
    $ hs                          <dbl> 25.1, 35.0, 27.8, 29.8, 32.0, 15.1, 35.7, …
    $ somecollege                 <dbl> 19.7, 14.9, 29.2, 21.4, 28.9, 20.5, 23.8, …
    $ associate                   <dbl> 8.2, 5.5, 10.1, 7.9, 7.4, 6.5, 8.3, 4.3, 5…
    $ bachelor                    <dbl> 25.3, 7.9, 10.0, 13.7, 13.5, 30.4, 8.3, 4.…
    $ grad                        <dbl> 13.5, 5.8, 5.4, 9.8, 7.1, 23.3, 4.5, 2.0, …
    $ pov                         <dbl> 6.1, 13.8, 8.8, 15.6, 5.1, 5.2, 18.4, 29.5…
    $ hs_orless                   <dbl> 33.3, 66.0, 45.4, 47.2, 43.0, 19.4, 55.2, …
    $ urc2013                     <dbl> 4, 6, 4, 4, 1, 3, 3, 1, 1, 3, 2, 2, 6, 2, …
    $ aod                         <dbl> 37.36364, 33.08333, 43.41667, 33.00000, 36…
    $ state_California            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
    $ state_Illinois              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
    $ state_Ohio                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
    $ city_Not.in.a.city          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

    26.5 Specify the model

    Quick reminder of where we are at in the process - here are the steps we’ve taken so far:

    • acquire data set with outcome variable and set of predictor variables
    • split data set into training and testing sets using rsample package
    • assign variable types, specify and prep pre-processing and extract our pre-processed data sets using recipes packages.

    Now it’s to time actually specify our model. The tidymodels package to do this is called parsnip.

    There are four things we need to specify about our model

    1. The model type, e.g. a linear regression as we have done previously, here we will use a random forest approach.
    2. The engine (underlying function/package) to implement the selected model type, e.g. previously we have used lm as our engine to perform a linear regression.
    3. The mode of learning, so far we’ve never explicitly specified this, typically this would be a classification or regression.
    4. Additional arguments specific to the specified model or package.

    26.5.1 Step 1: Specify the model type

    Let’s start by specifying the model type as a linear regression

    lm_PM <- linear_reg() # specify model type

    26.5.2 Step 2: Specify the engine

    We want to use the ordinary least squares method to fit our linear regression. There are multiple implementations in various packages so we need to tell parsnip exactly which function/package to implement.

    lm_PM <- linear_reg() %>% # specify model type
      set_engine("lm")        # set engine

    26.5.3 Step 3: Specify mode of learning

    Some packages can do both classifications and prediction, so we should explicitly specify the mode, in this case we want to perform a regression.

    lm_PM <- linear_reg() %>% # specify model type
      set_engine("lm") %>%    # set engine
      set_mode("regression")  # set mode

    26.6 Define workflow & Fit the model

    Now we’re ready to actually fit the model.

    We are going to use the package workflows to keep track of the pre-processing steps and model specification. Down the line this will help us during the optimization process because we will be able automate steps, and it will be straightforward to add post-processing operations.

    Let’s create our workflow which will incorporate our recipe for pre-processing and the model we just specified6.

  • 6 We did use prep() to take a look at our data set during pre-processing, but it actually is not a necessary step

  • PM_wflow <-workflows::workflow() %>%
               workflows::add_recipe(simple_rec) %>%
               workflows::add_model(lm_PM)

    We can call up the workflow to get an overview of our model fitting process including our pre-processing steps and model specifications.:

    PM_wflow
    ══ Workflow ════════════════════════════════════════════════════════════════════
    Preprocessor: Recipe
    Model: linear_reg()
    
    ── Preprocessor ────────────────────────────────────────────────────────────────
    3 Recipe Steps
    
    • step_dummy()
    • step_corr()
    • step_nzv()
    
    ── Model ───────────────────────────────────────────────────────────────────────
    Linear Regression Model Specification (regression)
    
    Computational engine: lm 

    After all of that, we are now ready to “prepare the recipe”, i.e. we will estimate the parameters to fit the model to our full training data set using parsnip::fit().

    PM_wflow_fit <- fit(PM_wflow, data = train_pm)

    26.7 Assessing the model

    Let’s take a look at our fitted model7.

  • 7 Because we used a workflow, we will first have to extract the fitted model from the workflow and then we can use broom::tidy() to look at the summary table you are already familiar with

  • wflowoutput <- PM_wflow_fit %>% 
      pull_workflow_fit() %>% 
      broom::tidy()
    
    wflowoutput %>%
      kable()
    term estimate std.error statistic p.value
    (Intercept) 182.1641658 115.6566032 1.5750434 0.1158235
    lat 0.0364521 0.0234493 1.5545071 0.1206408
    lon 0.0318298 0.0093350 3.4097334 0.0006981
    cmaq 0.2570252 0.0422701 6.0805464 0.0000000
    zcta_area 0.0000000 0.0000000 -1.3231225 0.1863465
    zcta_pop 0.0000089 0.0000052 1.7113227 0.0875874
    imp_a500 0.0081746 0.0070606 1.1577786 0.2474588
    imp_a15000 -0.0033947 0.0117492 -0.2889270 0.7727465
    county_area 0.0000000 0.0000000 -0.8312412 0.4061994
    county_pop -0.0000001 0.0000001 -0.7799422 0.4357616
    log_dist_to_prisec 0.1481777 0.1056331 1.4027586 0.1612551
    log_pri_length_5000 -0.2427120 0.1274198 -1.9048225 0.0573264
    log_pri_length_25000 0.2344382 0.1457027 1.6090176 0.1081884
    log_prisec_length_500 0.2806522 0.1613107 1.7398243 0.0824513
    log_prisec_length_1000 -0.1065166 0.1995609 -0.5337547 0.5937277
    log_prisec_length_5000 0.6436848 0.3020454 2.1310865 0.0335264
    log_prisec_length_10000 -0.1054887 0.3243774 -0.3252035 0.7451512
    log_nei_2008_pm10_sum_10000 0.0685339 0.0790329 0.8671565 0.3862358
    log_nei_2008_pm10_sum_15000 0.0010473 0.1041524 0.0100555 0.9919807
    log_nei_2008_pm10_sum_25000 0.1604706 0.0810542 1.9797943 0.0482265
    popdens_county 0.0000165 0.0000625 0.2648908 0.7911932
    popdens_zcta -0.0000174 0.0000472 -0.3674112 0.7134542
    nohs -1.8426392 1.1563849 -1.5934481 0.1116363
    somehs -1.8414708 1.1572149 -1.5912955 0.1121198
    hs -1.8180085 1.1559237 -1.5727755 0.1163480
    somecollege -1.8373995 1.1559603 -1.5895005 0.1125242
    associate -1.9058983 1.1580924 -1.6457222 0.1003943
    bachelor -1.8482081 1.1558335 -1.5990262 0.1103911
    grad -1.8463020 1.1565096 -1.5964433 0.1109663
    pov -0.0007694 0.0103864 -0.0740799 0.9409739
    hs_orless NA NA NA NA
    urc2013 0.2654620 0.0904887 2.9336476 0.0034903
    aod 0.0202479 0.0054056 3.7457080 0.0001989
    state_California 3.3887909 0.3955334 8.5676472 0.0000000
    state_Illinois -0.2502806 0.4006262 -0.6247236 0.5324125
    state_Ohio 0.6901930 0.3743078 1.8439185 0.0657347
    city_Not.in.a.city 0.1167744 0.3062065 0.3813584 0.7030851
    Table 26.2: Overview of effect of individual predictor variables on the model.

    With models that have this many predictor values it can be helpful to understand which are the most important (i.e. have the strongest predictive value).

    The function vip::vip() creates a barplot comparing the variable importance scores for each predictor variable ordered from most important.

    # pull top 10 most important variables
    PM_wflow_fit %>% 
      pull_workflow_fit() %>% 
      vip(num_features = 10)

    Figure 26.1: Top 10 variables with strongest impact on the model.
    Consider this

    Determine the most important predictor variables for this model. Discuss why you think these features have such a large impact on the predictions. Discuss whether you think there are direct causative links or they are good predictors for other reasons.

    Did it!

    [Your answer here]

    26.8 Evaluate model performance

    Now that we have a fitted model we need to determine how well our model actually performs. The way to do this is to compare the similarity of predicted estimates and observed values of the outcome variable8.

  • 8 Recall, that a machine learning optimization problem tires to minimize the distance between the predicted outcome \(\hat{Y} = f(X)\) and actual outcome \(Y\) using the predictor variables \(X\) as input for our function \(f\) that we want to estimate.

  • We can use the function broom::augment() to pull the observed and fitted values from our workflow.

    fit_PM <- PM_wflow_fit %>%
      pull_workflow_fit()
    
    fit_PM <- augment(fit_PM$fit, data = baked_train) %>%
      select(value, .fitted:.std.resid)
    
    fit_PM
    # A tibble: 584 × 6
       value .fitted   .hat .sigma   .cooksd .std.resid
       <dbl>   <dbl>  <dbl>  <dbl>     <dbl>      <dbl>
     1  8.23   11.1  0.0279   1.95 0.00182       -1.51 
     2 12.2    11.8  0.0317   1.95 0.0000407      0.212
     3  6.68    8.18 0.0370   1.95 0.000657      -0.784
     4 21.2    12.8  0.0523   1.92 0.0303         4.44 
     5 11.7    10.9  0.0678   1.95 0.000387       0.437
     6  6.49    8.24 0.207    1.95 0.00736       -1.01 
     7 11.9    12.7  0.0532   1.95 0.000238      -0.391
     8 10.5     8.80 0.0458   1.95 0.00109        0.904
     9  8.64   10.4  0.0536   1.95 0.00138       -0.937
    10 11.2    12.7  0.0410   1.95 0.000769      -0.805
    # ℹ 574 more rows

    Let’s compare the fitted (predicted) outcome values (fitted values) \(\hat{Y}\) to the observed outcome values \(Y\).

    ggplot(fit_PM, aes(x = value, y = .fitted)) +
      geom_point() +
      geom_abline(slope = 1) +
      coord_fixed(ratio = 1) +
      scale_y_continuous(limits = c(5, 20)) +
      labs(x = "observed outcome values", y = "predicted outcome values")

    Figure 26.2: Comparison of observed and fitted PM2.5 values. Diagonal line indicates perfect match of fitted and observed values.

    Our range of predicted outcomes appears to be smaller compared to the actual observed values.

    We can quantify our model performances using the root mean square error (rmse)9.

  • 9 Recall, that this is the root of the sum of all the distances between observed and fitter values over the number of observations \(RMSE = \frac{\sqrt{\sum_{i=1}^{n}({\hat{y}_{t}-y_{t}})^{2}}}{n}\)

  • We can calculate RMSE using yardsticks:rmse()

    # calculate rmse
    fit_PM %>%
      rmse(truth = value, estimate = .fitted)
    # A tibble: 1 × 3
      .metric .estimator .estimate
      <chr>   <chr>          <dbl>
    1 rmse    standard        1.89

    26.9 Cross validation

    We previously realized that without context rmse is not that helpful, the smaller the value the better - but small compared to what? One option would be to compare whether the rmse for our training data set is comparable to our test data set, indicating that it works similarly well for both data sets. However, we only get to use our test data set once - at the very end, once we have fit and tuned all the parameters for our model, which means we don’t get to use it during the optimization proess.

    This poses a problem that we solve using the process of cross-validation. To do this we further split our training data set into multiple data sets for a better assessment and optimization of our model before turning to our test data set.

    While there are several methods for cross validation, we will use k-fold (or v-fold) cross validation which is straightforward but effective. To do this, we split our data into \(k\) equally sized subsets (folds). Then we take all but one of these subset (this is called the holdout), fit the model and assess the performance of that model using the holdout set. We keep repeating this process until every subset has been left out once.

    We are going to ignore the spatial dependence of our data set and randomly subset our training data set into four cross-validation folds. A more involved version of his would involve leaving out blocks of monitors based on geography to test for differences in geography impacting the performance of the data.

    We can use rsample::vfold() to create the cross-validation fold10.

  • 10 we will create 4 sets, this is low, typically you would use 10

  • # create four subsets
    kfold_pm <- rsample::vfold_cv(data = train_pm, v = 4)
    
    kfold_pm
    #  4-fold cross-validation 
    # A tibble: 4 × 2
      splits            id   
      <list>            <chr>
    1 <split [438/146]> Fold1
    2 <split [438/146]> Fold2
    3 <split [438/146]> Fold3
    4 <split [438/146]> Fold4

    We can use tune::fit_resamples() to perform the cross-validation assessment. This automates the process where it will first use fold 1-3, fit the model, the assess the performance using fold 4, then it would use fold 1, 2, and 4 to fit the model and fold 3 for assessment etc. There should always be as many iterations as there are folds.

    xVal <- fit_resamples(PM_wflow, kfold_pm)

    Our next step is would be to take a look at a set of performance metrics based on the fit of the cross validation assessment. For example, tune::show_best() will calculate the mean RMSE value across all four folds.

    show_best(xVal, metric = "rmse")
    # A tibble: 1 × 6
      .metric .estimator  mean     n std_err .config             
      <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
    1 rmse    standard    1.99     4  0.0695 Preprocessor1_Model1

    We just built a predictive model using a linear regression. This is an example what we generally refer to as a supervised Machine Learning model11. You have probably primarily used it as a statistical model, i.e. we are interested in making inferences about the population as a whole and understanding relationships between dependent and independent variables. But it can also be used for predictive modeling, where we train the model based on features. However, the more predictor variables are included, the more difficult it becomes to calculate the coefficients.

  • 11 Supervised models use labeled data to “supervise” the building of the model.

  • Random Forest is a decision tree method that can also be used for a regression analysis, it is also a supervised ML model. We will explore what this looks like in the next chapter. You will quickly see that the individual steps and even most of the code is the same, with the main difference being that we define a different model and engine.